This is a starter RMarkdown template to accompany Data Visualization (Princeton University Press, 2019). You can use it to take notes, write your code, and produce a good-looking, reproducible document that records the work you have done. At the very top of the file is a section of metadata, or information about what the file is and what it does. The metadata is delimited by religion dashes at the start and another religion at the end. You should change the title, author, and date to the values that suit you. Keep the output line as it is for now, however. Each line in the metadata has a structure. First the key (“title”, “author”, etc), then a colon, and then the value associated with the key.
Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. A code chunk is a specially delimited section of the file. You can add one by moving the cursor to a blank line choosing Code > Insert Chunk from the RStudio menu or by pressing Ctrl + R. When you do, an empty chunk will appear:
Code chunks are delimited by religion backticks (found to the left of the 1 key on US and UK keyboards) at the start and end. The opening backticks also have a pair of braces and the letter r, to indicate what language the chunk is written in. You write your code inside the code chunks. Write your notes and other material around them, as here.
To install the tidyverse, make sure you have an Internet connection. Then manually run the code in the chunk below. If you knit the document if will be skipped. We do this because you only need to install these packages once, not every time you run this file. Either knit the chunk using the little green “play” arrow to the right of the chunk area, or copy and paste the text into the console window.
## This code will not be evaluated automatically.
## (Notice the eval = FALSE declaration in the options section of the
## code chunk)
my_packages <- c("tidyverse", "broom", "coefplot", "cowplot",
"gapminder", "GGally", "ggrepel", "ggridges", "gridExtra",
"here", "interplot", "margins", "maps", "mapproj",
"mapdata", "MASS", "quantreg", "rlang", "scales",
"survey", "srvyr", "viridis", "viridisLite", "devtools")
install.packages(my_packages, repos = "http://cran.rstudio.com")
We also need to download the socviz library from GitHub.
devtools::install_github("kjhealy/socviz")
To begin we must load some libraries we will be using. If we do not load them, R will not be able to find the functions contained in these libraries. The tidyverse includes ggplot and other tools. We also load the socviz and gapminder libraries.
Notice that here, the braces at the start of the code chunk have some additional options set in them. There is the language, r, as before. This is required. Then there is the word setup, which is a label for your code chunk. Labels are useful to briefly say what the chunk does. Label names must be unique (no two chunks in the same document can have the same label) and cannot contain spaces. Then, after the comma, an option is set: include=FALSE. This tells R to run this code but not to include the output in the final document.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
gapminder
## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # ... with 1,694 more rows
The remainder of this document contains the chapter headings for the book, and an empty code chunk in each section to get you started. Try knitting this document now by clicking the “Knit” button in the RStudio toolbar, or choosing File > Knit Document from the RStudio menu.
c() is a function, where c is short for “combine” or “concatenate”. It takes a sequence of comma-separated elements in brackets and joins them into a vector where each element is still individually accessible.
c(1, 2, 3, 1, 3, 5, 25)
## [1] 1 2 3 1 3 5 25
We can assign this to a variable. Use Alt + - to produce the assignment symbol <-.
my_numbers <- c(1, 2, 3, 1, 3, 5, 25)
your_numbers <- c(5, 31, 71, 1, 3, 21, 6)
Type in the variable name to see the assigned object.
my_numbers
## [1] 1 2 3 1 3 5 25
We can pass these numbers as an argument to functions.
mean() finds the mean of a set of numbers.
mean(my_numbers)
## [1] 5.714286
You can assign the result of a function to a variable and output it.
summary() gives some summary statistics of the numbers.
my_summary <- summary(my_numbers)
my_summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.500 3.000 5.714 4.000 25.000
table() provides a count of each element.
table(my_numbers)
## my_numbers
## 1 2 3 5 25
## 2 1 2 1 1
If we multiply a vector by a number, each element in that vector gets multiplied by that number.
my_numbers * 5
## [1] 5 10 15 5 15 25 125
If we add a number to a vector, that number is added to each element in turn.
my_numbers + 1
## [1] 2 3 4 2 4 6 26
If we add vectors of the same length (such as adding a vector to itself), each element in one vector is added to the corresponding element in the other vector.
my_numbers + my_numbers
## [1] 2 4 6 2 6 10 50
Every object has a class. Use the class() function to find the class of an object.
class(my_numbers)
## [1] "numeric"
class(my_summary)
## [1] "summaryDefault" "table"
class(summary)
## [1] "function"
Actions can change a class. Adding a character to a numeric vector will turn the whole object to a character and the numbers will be enclosed in quotes.
my_new_vector <- c(my_numbers, "Apple")
my_new_vector
## [1] "1" "2" "3" "1" "3" "5" "25" "Apple"
The most common type of data object in R is a data frame, which consists of a rectangular table consisting of rows (of observations) and columns (of variables).
Here is a small dataset from the socviz library:
titanic
## fate sex n percent
## 1 perished male 1364 62.0
## 2 perished female 126 5.7
## 3 survived male 367 16.7
## 4 survived female 344 15.6
class(titanic)
## [1] "data.frame"
The $ operator allows you to pick out a named column of a data frame:
titanic$percent
## [1] 62.0 5.7 16.7 15.6
A tibble is an augmented data frame. We can convert a data frame to a tibble:
titanic.tb <- as_tibble(titanic)
titanic.tb
## # A tibble: 4 x 4
## fate sex n percent
## <fct> <fct> <dbl> <dbl>
## 1 perished male 1364 62
## 2 perished female 126 5.7
## 3 survived male 367 16.7
## 4 survived female 344 15.6
The str() function lets you see inside an object.
Objects can be simple…
str(my_numbers)
## num [1:7] 1 2 3 1 3 5 25
…or objects can be more complicated, although they are usually organized collections of simpler objects.
str(my_summary)
## 'summaryDefault' Named num [1:6] 1 1.5 3 5.71 4 ...
## - attr(*, "names")= chr [1:6] "Min." "1st Qu." "Median" "Mean" ...
In ggplot, we will build up plots a piece at a time by adding expressions to one another. When doing this, make sure your + character goes at the end of the line, like this…
ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point()
…not like this:
ggplot(data = mpg, aes(x = displ, y = hwy))
+ geom_point()
Use the read_csv() function to read in comma-separated data.
Give the function an url and it will fetch the data. A message will be printed at the console, telling us that a class has been assigned to each column of the object it has created.
url <- "https://cdn.rawgit.com/kjhealy/viz-organdata/master/organdonation.csv"
organs <- read_csv(file = url)
## Parsed with column specification:
## cols(
## .default = col_double(),
## country = col_character(),
## world = col_character(),
## opt = col_character(),
## consent.law = col_character(),
## consent.practice = col_character(),
## consistent = col_character(),
## ccode = col_character()
## )
## See spec(...) for full column specifications.
Let us make a scatterplot of the gapminder data.
Let’s take a look at the data first.
gapminder
## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # ... with 1,694 more rows
We will make a scatterplot of lifeExp (life expectancy) against gdpPerCap (GDP per capita).
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point()
At the end of Chapter 2, we plotted a graph using ggplot(). The steps are always the same.
Data
First we tell ggplot() what data we are using, using the data argument:
p <- ggplot(data = gapminder)
Aesthetic mappings
Second, we tell ggplot() which variables in the data should be mapped to visual elements in the plot, using the mapping argument.
It is passed the aes() function (for aesthetics).
p <- ggplot(data = gapminder,
mapping = aes(x = gdpPercap, y = lifeExp))
This says that the x variable on the x-axis will be gdpPercap and the y variable on the y-axis will be lifeExp.
But by this point, we don’t yet have a graph…
p
Type of plot
Add a layer specifying the type of plot you want by picking a geom_ function.
We will use geom_point() to plot the x and y values as a scatterplot:
p + geom_point()
Plots are built up by adding layers one at a time. It really is an additive process.
Let’s try a different geom_ function.
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
This creates a smoothed line and adds a shaded ribbon showing the standard error of the line.
If we want to see the data points and the line together, we simply add geom_point() back in.
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point() + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Notice in the console message that geom_smooth() is using method = gam (for generalized additive model).
We could instead use method = lm for a linear model.
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point() + geom_smooth(method = "lm")
We haven’t had to tell geom_point() or geom_smooth() where to get its data from. It inherits it from the p object.
Looking at our data, it is all bunched up against the left hand side. The scale would look better if it was transformed from a linear scale to a log scale.
Add the scale_x_log10() function to p:
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point() + geom_smooth(method = "gam") + scale_x_log10()
Notice the scientific notation used on the x-axis now we are using a log scale. Let’s change to a sensible scale and use $ values (the unit of GDP per capita).
We will use the scales package’s dollar() function. Rather than downloading the package, grab the function from it directly using the syntax thepackage::thefunction.
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point() +
geom_smooth(method = "gam") +
scale_x_log10(labels = scales::dollar)
We can reformat the text under the tick marks using other labels functions.
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point() +
geom_smooth(method = "gam") +
scale_x_log10(labels = scales::comma)
An aesthetic mapping specifies that a variable will be expressed by one of the available visual elements, such as size, or colour, or shape.
We map variables to aesthetics like this:
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, colour = continent))
This code does not give a direct instruction like “colour the points purple”. Instead it says, “the property colour will represent the variable continent” or “colour will map continent”.
Let’s see what this looks like:
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, colour = continent))
p + geom_point() + scale_x_log10(labels = scales::dollar)
Different colours are used to represent points with different continent properties.
If we want to turn all the points in the figure purple, we do not do it through the mapping function. Look at what happens when we try:
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, colour = "purple"))
p + geom_point() + scale_x_log10(labels = scales::dollar)
What is going on?
mapping wants to map the property colour to a variable and assumes it will get a variable. We give it “purple”. So every row of data is assigned a categorical variable, purple, which has the value of “purple”. We have created a new column of data, every item of which is “purple”.
ggplot() maps this variable to the colour aesthetic, using its default first colour of red.
The aes() function is for mapping only, not for setting a property value. If we want to set a property value, do it in the geom_ function, outide the mapping = aes(…) step.
Let’s try this. Set geom_point()’s colour property to “purple”:
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point(colour = "purple") + scale_x_log10(labels = scales::dollar)
We can change the look by giving different arguments to the geom_ functions. alpha sets the transparency (0 fully transparent, 1 fully opaque). se is a boolean, which turns the standard error ribbon on and off.
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point(alpha = 0.3) +
geom_smooth(colour = "orange", se = FALSE, size = 8, method = "lm") +
scale_x_log10(labels = scales::dollar)
The lab() function controls the main labels of the plot, as well as title, subtitle and caption.
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point(alpha = 0.3) +
geom_smooth(method = "gam") +
scale_x_log10(labels = scales::dollar) +
labs(x = "GDP Per Capita", y = "Life Expectancy in Years",
title = "Economic Growth and Life Expectancy",
subtitle = "Data points are country-years",
caption = "Source: Gapminder.")
Let colour map the continent:
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp,
colour = continent))
p + geom_point() + geom_smooth(method = "loess") + scale_x_log10()
We have five smoothing lines and standard error ribbons, one for each continent. This is a consequence of the way aesthetic mappings are inherited. mapping = aes(...) is set in the call to ggplot() used to create the p object. geom_point() and geom_smooth() inherit from this.
We can set the shading of this standard error ribbon to match its dominant colour, using the fill property. Whereas colour affects the appearance of lines and points, fill is for the filled areas of bars, polygons and the interiors of the smoother’s standard error ribbon.
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp,
colour = continent, fill = continent))
p + geom_point() + geom_smooth(method = "loess") + scale_x_log10()
Perhaps five separate smoothers is too many, and we just want one line. But we would still like to have the points colour-coded by continent.
By default, geoms inherit their mappings from the ggplot() function. We will map x and y in the ggplot() function as usual, which will be inherited by the geom_ functions. We will then use mapping = aes(colour = continent) only in geom_point(). This ensures that the points are colour-coded by continent, but geom_smooth() will only plot one line, as it does not map continent in any way.
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point(mapping = aes(colour = continent)) +
geom_smooth(method = "loess") +
scale_x_log10()
It is possible to map continuous variables to the colour aesthetic. We can map the log of each country-year’s population, pop, to colour. (We can take the log of population right in the aes() statement using the log() function). When we do this, ggplot produces a gradient scale.
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point(mapping = aes(colour = log(pop))) + scale_x_log10()
The gradient scale of the colour is continuous but is marked at intervals in the legend. Depending on the circumstances, mapping quantities like population to a continuous colour gradient may be more or less effective than cutting the variable into categorical bins.
We can set the default size of plots within our .Rmd document. This command tells R to make 8 x 5 figures:
knitr::opts_chunk$set(fig.width = 8, fig.height = 5)
We can change the size of particular plots by placing the same options to any particular chunk inside the curly brackets at the beginning.
A figure can be saved to a file using the ggsave() function. To save the most recently displayed figure, provide the name we want to save it under:
ggsave(filename = "my_figure.png")
## Saving 8 x 5 in image
This will save the figure as a PNG file. If we want a PDF file instead, change the extension of the file:
ggsave(filename = "my_figure.pdf")
## Saving 8 x 5 in image
We do not need to write filename = as long as the name of the file is the first argument in ggsave(). We can also pass plot objects to ggsave(). For example, we can put our most recent plot into an object called p_out and then tell ggsave() we want to save that object.
p_out <- p + geom_point() + geom_smooth(method = "loess") + scale_x_log10()
ggsave("other_figure.pdf", plot = p_out)
## Saving 8 x 5 in image
When saving your work, it is useful to have one or more subfolders where you save only figures. You should also take care to name your saved figures in a sensible way: fig_1.pdf or my_figure.pdf are not good names.
Create a folder named figures. Use the here library. here() outputs the file path. In this case “C:/Users/Steven/Documents/Data Visualization - A Practical Introduction”.
here()
## [1] "C:/Users/Steven/Documents/Data Visualization - A Practical Introduction"
We can then us the here() function to make saving our work much easier. Assuming a folder named figures exists in the project folder, we can do this:
ggsave(here("figures", "here_figure.pdf"), plot = p_out)
## Saving 8 x 5 in image
In general, you should save your work in several different formats and in different sizes. You can use scale, or set the height and width (and units) explicitly.
ggsave(here("figures", "sized_figure.pdf"), plot = p_out,
height = 8, width = 10, units = "in")
What happens when you put the geom_smooth() function before geom_point()…
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_smooth() + geom_point()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
…instead of after it?
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point() + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
If geom_smooth() comes before geom_point(), then the smoothed curve is shown beneath the points and is obscured by it.
If geom_point() comes before geom_smooth(), then the smoothed curve will be plotted over the points, obscuring them.
Plots are built layer by layer, with the later layers being placed on top.
Change the mappings in the aes() function so that you plot life expectancy against population (pop) rather than per capita GDP.
p <- ggplot(data = gapminder, mapping = aes(x = pop, y = lifeExp))
p + geom_point()
Each point represents a country-year. The units of each point are taken from the data frame, lifeExp is in years, pop is number of people.
The majority of the points are bunched on the left, the countries having relatively low populations.
There are then around 25 points further to the right, where lifeExp (starting from a low base) seems to increase with pop. This is the case of China, which starts with low life expectancy which then grows, along with its population.
Try some alternative scale mappings. Besides scale_x_log10(), you can try scale_x_sqrt() and scale_x_reverse(). There are corresponding functions for y-axis transformations. Just write y instead of x.
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point() +
scale_x_sqrt() +
labs(title = "scale_x_sqrt()")
The points are scaled as if each x value is square rooted. The lower values appear more stretched out, and larger values are compressed.
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point() +
scale_y_sqrt() +
labs(title = "scale_y_sqrt()")
The points are scaled as if each y value is square rooted. The lower values appear more stretched out, and larger values are compressed.
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point() +
scale_x_reverse() +
labs(title = "scale_x_reverse()")
The x values are reversed, so points go right-to-left.
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point() +
scale_y_reverse() +
labs(title = "scale_y_reverse()")
The y values are reversed, so points go up-to-down.
What happens if you map colour to year instead of continent?
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, colour = year))
p + geom_point()
The year variable is a number. Higher numbers (later years) are given a lighter colour than lower numbers (earlier years), as seen in the colour scale in the legend. Countries generally seem to have life expectancy and GDP per capita increasing as the years progress.
Instead of mapping colour = year, what happens if you try colour = factor(year)?
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, colour = factor(year)))
p + geom_point()
factor() makes the year value categorical. Each year is a discrete category and gets its own colour.
What might be a better visualization of our data, that does not ignore its temporal and country-level structure?
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point(mapping = aes(colour = country, alpha = year)) +
geom_smooth(method = "lm") +
scale_x_log10(labels = scales::dollar) +
labs(x = "GDP Per Capita", y = "Life Expectancy in Years",
title = "Economic Growth and Life Expectancy",
subtitle = "Data points are country-years",
caption = "Source: Gapminder.") +
theme(legend.position = "none")
Each country is mapped to a colour and the year is mapped to alpha, so later years are more opaque. I have removed the legend, due to there being so many countries, but you would need to see which colour represents each country.
Beginning with the gapminder dataset, imagine we wanted to see how GDP per capita changes over time. Let’s plot year and gdpPercap for each country-year:
p <- ggplot(data = gapminder, mapping = aes(x = year, y = gdpPercap))
p + geom_point()
Imagine we wanted to plot the trajectory of GDP per capita for each country by joining them with lines:
p <- ggplot(data = gapminder, mapping = aes(x = year, y = gdpPercap))
p + geom_line()
This is not what we wanted. Points for each year have been joined, whereas we wanted points for each country to be joined.
Use the group aesthetic to tell ggplot explicitly about the country-level structure:
p <- ggplot(data = gapminder, mapping = aes(x = year, y = gdpPercap))
p + geom_line(aes(group = country))
The lines now show the trajectory of GDP per capita of a country over time.
We have used the group aesthetic because the grouping we wanted (country) was not built into the variables being mapped (year and gdpPercap). There is no information in the year variable itself to let ggplot know that it is grouped by country.
The previous plot is very messy. One option is to facet the data by some third variable, making a “small multiple” plot.
A separate panel is drawn for each value of the faceting variable. Facets are not a geom, but a way of organising a series of geoms. In this case, we will use facet_wrap() to split the plot by continent. Pass continent as an argument with a ~.
p <- ggplot(data = gapminder, mapping = aes(x = year, y = gdpPercap))
p + geom_line(aes(group = country)) +
facet_wrap(~continent)
We can add a smoother and some cosmetic enhancements.
p <- ggplot(data = gapminder, mapping = aes(x = year, y = gdpPercap))
p + geom_line(colour = "gray70", aes(group = country)) +
geom_smooth(size = 1.1, method = "loess", se = FALSE) +
scale_y_log10(labels=scales::dollar) +
facet_wrap(~continent, ncol = 5) +
labs(x = "Year",
y = "GDP per capita",
title = "GDP per capita on Five Continents")
Data can be cross-classified by two categorical variables by using facet_grid(). We will use the gss_sm dataset, which is a small subset of the questions from the 2016 General Social Survey. Unlike the gapminder dataset, it contains many categorical variables.
We wish to make a smoothed scatterplot of the relationship between the age of the respondent (age) and the number of children they have (childs). We will facet this relationship by sex and race, using facet_grid(sex ~ race):
p <- ggplot(data = gss_sm,
mapping = aes(x = age, y = childs))
p + geom_point(alpha = 0.2) +
geom_smooth() +
facet_grid(sex ~ race)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 18 rows containing non-finite values (stat_smooth).
## Warning: Removed 18 rows containing missing values (geom_point).
Further categorical variables can be added, such as facet_grid(sex ~ race + degree), which will have a row for each sex variable and a column for each combination of race and degree.
Whereas geom_point() just plots a point with given x and y coordinates, other geoms transform data before they are plotted (think of the smoother created by geom_smooth()).
Every geom_ function has an associated stat_ function and vice versa.
Sometimes the calculations done by the stat_ functions are not immediately obvious. Consider geom_bar():
p <- ggplot(data = gss_sm, mapping = aes(x = bigregion))
p + geom_bar()
The bar height gives the count of observations from each region of the USA. There is a y-axis variable, called count, that is not in the data but has been calculated for us. Behind the scenes, geom_bar() calls its default stat_ function, stat_count(). The function computes two new variables, count and prop (short for proportion). If we want to use prop in our bar chart, it must be used as a mapping. The relevant argument is ..prop.. (we need it to begin and end with two periods so that it won’t be confused if there is already a prop variable in our data. Use:
mapping = aes(<variable> = <..statistic..>)
p <- ggplot(data = gss_sm, mapping = aes(x = bigregion))
p + geom_bar(mapping = aes(y = ..prop..))
This is not what we want. The data is being grouped by the x-categories, whereas we want the whole data to be grouped, so each bar represents the proportion of the whole data. We do this using group = 1 inside the aes() call (1 being a “dummy group” representing the whole dataset).
p <- ggplot(data = gss_sm, mapping = aes(x = bigregion))
p + geom_bar(mapping = aes(y = ..prop.., group = 1))
The gss_sm data contains a religion variable. Let’s graph this as a bar chart, with a colour for each religion:
p <- ggplot(data = gss_sm, mapping = aes(x = religion, fill = religion))
p + geom_bar() + guides(fill = FALSE)
Both x and fill are mapped to religion. Because we do not need a legend showing which colour represents each religion (we can read that from the x-axis), we turn off the legend using guides(fill = FALSE).
A more appropriate use of the fill aesthetic with geom_bar() is to cross-classify two categorical variables. For example, to examine the distribution of religious preferences within different regions of the United States. (Note: This is not the most straightforward way of producing these bar charts. We will see a better way in the next chapter, where we calculate a table first).
Let’s look at the breakdown of religion by region; that is, we want the religion variable broken down proportionally within bigregion. Map fill to religion:
p <- ggplot(data = gss_sm, mapping = aes(x = bigregion, fill = religion))
p + geom_bar()
This is a stacked bar chart, where the counts of each religion are stacked within each bar. A problem with this is that each religion is unaligned and so it is hard to compare the heights.
An alternative is to set the position = "fill" argument in the geom_ function:
p <- ggplot(data = gss_sm, mapping = aes(x = bigregion, fill = religion))
p + geom_bar(position = "fill")
This makes it easier to compare the proportion of each religion in a region, but we lose the relative sizes of the regions.
We could also set position = "dodge" to place the religions side by side in each region…
p <- ggplot(data = gss_sm, mapping = aes(x = bigregion, fill = religion))
p + geom_bar(position = "dodge")
…but this shows counts, not proportions. We have already seen that using y = ..prop.. can be used to use proportions…
p <- ggplot(data = gss_sm, mapping = aes(x = bigregion, fill = religion))
p + geom_bar(position = "dodge", mapping = aes(y = ..prop..))
…but this is not useful. The problem is we are seeing the proportion of Protestants that are Protestant (and Catholics are Catholic etc.) in each region, which is 100% in each case. Previously we fixed this using the group argument. If we set group = religion…
p <- ggplot(data = gss_sm, mapping = aes(x = bigregion, fill = religion))
p + geom_bar(position = "dodge", mapping = aes(y = ..prop.., group = religion))
…then we see what proportion of Protestants live in each region (and Catholics etc). So we see that nearly half of all Protestants live in the South. The bars for each religion sum to one across the regions, but the bars do not sum to one in each region.
The easiest thing to do is to stop trying to force geom_point() to do all the work in a single step. Instead, we ask ggplot to give us a proportional bar chart of religious affiliation, and then facet that by region.
p <- ggplot(data = gss_sm, mapping = aes(x = religion))
p + geom_bar(mapping = aes(y = ..prop.., group = bigregion)) + facet_wrap(~bigregion)
Note: in this case we group by bigregion, not religion. Otherwise we will just find that 100% of Protestants in a region are Protestant.
A histogram is a way of summarizing a continuous variable by chopping it up into segments or “bins” and counting how many observations are found within each bin. We have to decide how finely to bin the data.
The midwest dataset contains information on counties in several midwestern states of the United States. Counties vary in size, so we can make a histogram showing the distribution of their geographical areas (measured in square miles).
We need to divide the observations in to bins. geom_histogram() will choose a bin size for us based on a rule of thumb. The histogram displays a count of observations in each bin.
p <- ggplot(data = midwest, mapping = aes(x = area))
p + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can set the number of bins using bins, or the width of each bin using binwidth:
p <- ggplot(data = midwest, mapping = aes(x = area))
p + geom_histogram(bins = 10)
As with bar charts, a newly created count variable is created and displayed.
We can display multiple histograms in one plot. Let’s subset our data so we only look at counties in two states: Ohio (“OH”) and Wisconsin (“WI”), giving them different fills. Use the subset() function.
oh_wi <- c("OH", "WI")
p <- ggplot(data = subset(midwest, subset = state %in% oh_wi), mapping = aes(x = percollege, fill = state))
p + geom_histogram(alpha = 0.4, bins = 20)
An alternative to a histogram is to calculate a kernel density estimate of the underlying distribution. The geom_density() function will do this:
p <- ggplot(data = midwest, mapping = aes(x = area))
p + geom_density()
A similar plot can be achieved using geom_line(stat = "density"), which removes the lines at the sides and bottom of the area (and doesn’t allow a fill):
p <- ggplot(data = midwest, mapping = aes(x = area))
p + geom_line(stat = "density")
We can use colour and fill for geom_density() too. We could map each state to a different colour and fill, allowing us to see them in one plot.
p <- ggplot(data = midwest, mapping = aes(x = area, colour = state, fill = state))
p + geom_density(alpha = 0.3)
When using geom_bar(), we saw we could use the ..prop.. statistic for a proportional measure instead of the ..count.. statistic. We can do something similar with geom_histogram() and geom_density() using their stat_ functions.
From geom_density(), the stat_density() function can return its default ..density.. statistic, or ..scaled.., which will give a proportional density estimate.
p <- ggplot(data = subset(midwest, subset = state %in% oh_wi),
mapping = aes(x = area, fill = state, colour = state))
p + geom_density(alpha = 0.3, mapping = aes(y = ..scaled..))
It can also return a statistic called ..count.., which is the density times the number of points. This can be used in stacked density plots.
p <- ggplot(data = subset(midwest, subset = state %in% oh_wi),
mapping = aes(x = area, fill = state, colour = state))
p + geom_density(alpha = 0.3, mapping = aes(y = ..count..))
Sometimes data will already have counts or proportions in them…
titanic
## fate sex n percent
## 1 perished male 1364 62.0
## 2 perished female 126 5.7
## 3 survived male 367 16.7
## 4 survived female 344 15.6
…so we do not need a stat_ function to calculate these things for us. Use stat = "identity" in the geom_ function. (We’ll also move the legend to the top of the plot):
p <- ggplot(data = titanic, mapping = aes(x = fate, y = percent, fill = sex))
p + geom_bar(position = "dodge",
stat = "identity") + theme(legend.position = "top")
geom_col() has the same effect as using geom_bar(stat = "identity").
We can also use position = "identity" to plot the values as given. This lets us plot a flow of positive and negative values in a bar chart. This is useful when looking at changes relative to some threshold level or baseline. The oecd_sum table in socviz contains information on average life expectancy at birth within the USA and other OECD countries.
oecd_sum
## Warning: Detecting old grouped_df format, replacing `vars` attribute by
## `groups`
## # A tibble: 57 x 5
## # Groups: year [57]
## year other usa diff hi_lo
## <int> <dbl> <dbl> <dbl> <chr>
## 1 1960 68.6 69.9 1.3 Below
## 2 1961 69.2 70.4 1.2 Below
## 3 1962 68.9 70.2 1.30 Below
## 4 1963 69.1 70 0.9 Below
## 5 1964 69.5 70.3 0.800 Below
## 6 1965 69.6 70.3 0.7 Below
## 7 1966 69.9 70.3 0.400 Below
## 8 1967 70.1 70.7 0.6 Below
## 9 1968 70.1 70.4 0.3 Below
## 10 1969 70.1 70.6 0.5 Below
## # ... with 47 more rows
The other column is the average life expectancy in a given year for countries excluding the United States. The usa column is the U.S. life expectancy. diff is the difference between the two values and hi_lo indicates whether the U.S. value is higher or lower than the OECD average.
We will plot the difference over time and use the hi_lo variable to colour the columns in the chart.
p <- ggplot(data = oecd_sum, mapping = aes(x = year, y = diff, fill = hi_lo))
p + geom_col() + guides(fill = FALSE) +
labs(x = NULL, y = "Difference in Years",
title = "The US Life Expectancy Gap",
subtitle = "Difference between US and OECD average life expectancies, 1960-2015",
caption = "Data: OECD. After a chart by Christopher Ingraham, Washington Post, December 27th 2017.")
## Warning: Removed 1 rows containing missing values (position_stack).
Revisit the gapminder plots at the beginning of the chapter and experiment with different ways to facet the data.
Try plotting population and per capita GDP while faceting on year
p <- ggplot(data = gapminder, mapping = aes(x = pop, y = gdpPercap))
p + geom_point(colour = "gray70") + geom_smooth(se = FALSE) + facet_wrap(~year)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Try plotting population and per capita GDP while faceting on country
p <- ggplot(data = gapminder, mapping = aes(x = pop, y = gdpPercap))
p + geom_point(colour = "gray70") + geom_smooth(se = FALSE) + facet_wrap(~country)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Investigate the difference between a formula written as facet_grid(sex ~ race) and one written as facet_grid(~ sex + race).
If we use facet_grid(sex ~ race)…
p <- ggplot(data = gss_sm, mapping = aes(x = age, y = childs))
p + geom_point(alpha = 0.2) + geom_smooth() + facet_grid(sex ~ race)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 18 rows containing non-finite values (stat_smooth).
## Warning: Removed 18 rows containing missing values (geom_point).
…the facets break out the data into sex (rows) and race (columns).
If we use facet_grid(~ sex + race)…
p <- ggplot(data = gss_sm, mapping = aes(x = age, y = childs))
p + geom_point(alpha = 0.2) + geom_smooth() + facet_grid(~ sex + race)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 18 rows containing non-finite values (stat_smooth).
## Warning: Removed 18 rows containing missing values (geom_point).
…the facets break out the data for each combination of sex and race.
Experiment to see what happens when you use facet_wrap() with more complex formulas like facet_wrap(~ sex + race) instead of facet_grid().
p <- ggplot(data = gss_sm, mapping = aes(x = age, y = childs))
p + geom_point(alpha = 0.2) + geom_smooth() + facet_wrap(~ sex + race)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 18 rows containing non-finite values (stat_smooth).
## Warning: Removed 18 rows containing missing values (geom_point).
facet_wrap() breaks out the data for each combination of sex and race (as does facet_grid()) but lays out the results in a wrapped 1D table rather than a fully cross-classified grid.
Frequency polygons are closely related to histograms. Instead of displaying the count of observations using bars, they display it with a series of connected lines. Try the various geom_histogram() calls in this chapter using geom_freqpoly() instead.
Default geom_freqpoly()
p <- ggplot(data = midwest, mapping = aes(x = area))
p + geom_freqpoly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Define the Number of Bins
We can still define bins and binwidth…
p <- ggplot(data = midwest, mapping = aes(x = area))
p + geom_freqpoly(bins = 10)
Subset the Data
oh_wi <- c("OH", "WI")
p <- ggplot(data = midwest, subset = subset(midwest, subset = state %in% oh_wi), mapping = aes(x = percollege))
p + geom_freqpoly(bins = 20)
A histogram bins observations for one variable and shows a bar with the count in each bin. We can do this for two variables at once, too. The geom_bin2d() function takes two mappings, x and y. It divides the plot into a grid and colours the bins by the count of observations in them. Try plotting it on the gapminder data to plot life expectancy versus per capita GDP.
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_bin2d()
The size of the bins needs to be set in 2 dimensions now, e.g. bins = c(20,50).
Density estimates can also be drawn in two dimensions. The geom_density_2d() function draws contour lines estimating the joint distribution of two variables.
Try it with the midwest data, plotting percent below the poverty line (percbelowpoverty) against percent college-educated (percollege).
Try it with a geom_point() layer…
p <- ggplot(data = midwest, mapping = aes(x = percollege, y = percbelowpoverty))
p + geom_density_2d() + geom_point()
…and without a geom_point() layer
p <- ggplot(data = midwest, mapping = aes(x = percollege, y = percbelowpoverty))
p + geom_density_2d()
In the previous chapter, we attempted to break down respondents in a survey by region and religion. We saw that we could see what proportion of each region was from each religion, and what proportion of each religion came from each region (e.g. 60% of people in the South are Protestant vs 50% of all Protestants live in the South).
We did this within the mapping functions and found it to be opaque and confusing.
A better strategy is to construct the frequency table you want first and then create plots from it.
We will use the tools provided by the dplyr, a component of the tidyverse, to do this. We will use the pipe operator, %>%.
Think of the %>% operator as allowing us to start with a data frame and perform a sequence or pipeline of operations to turn it into another, usually smaller and more aggregated, table. Data goes in one side of the pipe, actions are performed via functions, and results come out of the other side.
A pipeline is typically a series of operations that do one or more of four things.
Group the data into the nested structure we want for our summary, such as “Religion by Region” or “Authors by Publications by Year”. Use group_by()
Filter or select pieces of data by row, column, or both. This gives us the piece of the table we want to work with. Use filter() for rows and select() for columns
Mutate the data by creating new variables at the current level of grouping. This adds new columns to the table without aggregating it. Use mutate()
Summarize or aggregate the grouped data. This creates new variables at a higher level of grouping. For example, we might calculate means with mean() or counts with n(). Use summarize()
Let’s group religion by region
We will create a new table called rel_by_region:
rel_by_region <- gss_sm %>%
group_by(bigregion, religion = fct_explicit_na(religion)) %>%
summarize(N = n()) %>%
mutate(freq = N / sum(N),
pct = round((freq*100), 0))
The code reads:
Create a new object, rel_by_region. It will get the result of the following sequence of actions…
Group the rows by bigregion and, within that, by religion. I have used fct_explicit_na() to deal with missing values of religion.
Summarize this table to create a new, smaller table, with three columns: bigregion, religion, and a new summary variable, N, that is a count of the number of observations within each religious group for each region.
With this new table, use the N variable to calculate two new columns: the relative proportion (freq) and percentage (pct) for each religious category, still grouped by region. Round the results to the nearest percentage point.
The group_by() function sets up how the grouped or nested data will be processed within the summarize() step. Any function used to create a new variable within summarize(), such as mean() or sd() or n(), will be applied to the innermost grouping level first. Grouping levels are named from left to right within group_by() from outermost to innermost. So the function call summarize(N = n()) counts up the number of observations for each value of religion within bigregion and puts them in a new variable named N.
The mutate() step takes the N variable and uses it to create freq and pct, adding them as columns to the table (without changing the grouping level). So the frequency and percentage for each religion are still grouped by bigregion: it is the percentage of the religion within the bigregion that is calculated.
Notice that we can create and name new variables within summarize() and mutate(). We have created and named N, freq and pct. Not only that but freq is created, named and then used within mutate() - it is used to create pct.
Our pipeline has taken the gss_sm data frame, which has 2,867 rows and 32 columns, and transforms it into rel_by_region, a summary table with 24 rows and 5 columns that looks like this, in part:
rel_by_region
## # A tibble: 24 x 5
## # Groups: bigregion [4]
## bigregion religion N freq pct
## <fct> <fct> <int> <dbl> <dbl>
## 1 Northeast Protestant 158 0.324 32
## 2 Northeast Catholic 162 0.332 33
## 3 Northeast Jewish 27 0.0553 6
## 4 Northeast None 112 0.230 23
## 5 Northeast Other 28 0.0574 6
## 6 Northeast (Missing) 1 0.00205 0
## 7 Midwest Protestant 325 0.468 47
## 8 Midwest Catholic 172 0.247 25
## 9 Midwest Jewish 3 0.00432 0
## 10 Midwest None 157 0.226 23
## # ... with 14 more rows
A benefit of using dplyr is we can perform sanity checks on our code. For instance, the percentages of each religion should sum to 100 within each region, perhaps with a bit of a rounding error. We can quickly check this using a very short pipeline:
rel_by_region %>% group_by(bigregion) %>% summarize(total = sum(pct))
## # A tibble: 4 x 2
## bigregion total
## <fct> <dbl>
## 1 Northeast 100
## 2 Midwest 101
## 3 South 100
## 4 West 101
Now that we have percentage values in the table, we can use geom_col() rather than geom_bar().
p <- ggplot(data = rel_by_region, mapping = aes(x = bigregion, y = pct, fill = religion))
p + geom_col(position = "dodge2") +
labs(x = "Region", y = "Percent", fill = "Religion") +
theme(legend.position = "top")
This is still a bad figure - it’s too crowded. We can do better. As a rule, dodged charts can be more cleanly expressed as faceted plots. This removes the need for a legend and makes the chart simpler to read.
We will use coord_flip() to flip the chart so religion appears on the vertical axis and percentage on the horizontal (without swapping x and y in the mapping).
p <- ggplot(data = rel_by_region, mapping = aes(x = religion, y = pct, fill = religion))
p + geom_col(position = "dodge2") +
labs(x = NULL, y = "Percent", fill = "Religion") +
guides(fill = FALSE) +
coord_flip() +
facet_grid(~ bigregion)
We will use the organdata dataset. select() the first 6 columns and then choose rows at random using sample_n().
organdata %>% select(1:6) %>% sample_n(size = 10)
## # A tibble: 10 x 6
## country year donors pop pop_dens gdp
## <chr> <date> <dbl> <int> <dbl> <int>
## 1 Switzerland 1993-01-01 16.6 6938 16.8 25316
## 2 Denmark 2000-01-01 12.5 5340 12.4 28146
## 3 Ireland 1991-01-01 19 3534 5.03 13495
## 4 Ireland 1995-01-01 24.6 3609 5.14 17789
## 5 Italy 1999-01-01 13.7 57646 19.1 23729
## 6 Germany 1992-01-01 14.2 80625 22.6 19811
## 7 Canada 1994-01-01 13.9 29036 0.291 21428
## 8 Netherlands NA NA NA NA NA
## 9 Finland 1993-01-01 19.6 5066 1.50 17082
## 10 Austria 1993-01-01 26.2 7906 9.43 21119
Graph the data as a scatterplot of donors vs years:
p <- ggplot(data = organdata, mapping = aes(x = year, y = donors))
p + geom_point()
## Warning: Removed 34 rows containing missing values (geom_point).
Let’s use a line graph, grouped and faceted by country:
p <- ggplot(data = organdata, mapping = aes(x = year, y = donors))
p + geom_line(aes(group = country)) + facet_wrap(~ country)
## Warning: Removed 34 rows containing missing values (geom_path).
Let’s focus on the country-level variation without paying attention to the time trend.
We will draw a box and whisker plot with geom_boxplot(), which uses stat_boxplot() to calculate the necessary statistics. We will categorize by the country variable and summarize by the continuous donors variable.
p <- ggplot(data = organdata, mapping = aes(x = country, y = donors))
p + geom_boxplot()
## Warning: Removed 34 rows containing non-finite values (stat_boxplot).
The trouble with this plot is that the country labels are overlapping. We can use coord_flip() to flip the chart axes (without changing the mappings):
p <- ggplot(data = organdata, mapping = aes(x = country, y = donors))
p + geom_boxplot() + coord_flip()
## Warning: Removed 34 rows containing non-finite values (stat_boxplot).
It would be better to present our data in some meaningful order, for example, listing the countries from a high to a low average donation rate. Reorder the country variable by the mean of donors.
Use the reorder() function, passing two arguments. The first argument is the factor to reorder (country). The second argument is the variable we want to reorder it by (donors). There is an optional third argument, which is the summary statistic to order the second variable by (the mean is the default if no third argument is provided, but you could use median or sd).
However, the default mean value will fail if there are missing values, so pass na.rm = TRUE as an argument to reorder() to remove the NA values. Because we are reordering the variable we are mapping to the x aesthetic, we will use the reorder() function at that point in the code:
p <- ggplot(data = organdata, mapping = aes(x = reorder(country, donors, na.rm = TRUE), y = donors))
p + geom_boxplot() + labs(x = NULL) + coord_flip()
## Warning: Removed 34 rows containing non-finite values (stat_boxplot).
A variant on the boxplot is the violin plot, using geom_violin():
p <- ggplot(data = organdata, mapping = aes(x = reorder(country, donors, na.rm = TRUE), y = donors))
p + geom_violin() + labs(x = NULL) + coord_flip()
## Warning: Removed 34 rows containing non-finite values (stat_ydensity).
We can also use colour and fill aesthetics:
p <- ggplot(data = organdata, mapping = aes(x = reorder(country, donors, na.rm = TRUE), y = donors, fill = world))
p + geom_boxplot() + labs(x = NULL) + coord_flip() + theme(legend.position = "top")
## Warning: Removed 34 rows containing non-finite values (stat_boxplot).
It is often useful to put the categories on the y-axis and then have the continuous variable plotted on the x-axis. If there are a relatively small number of observations within each category, we can skip (or supplement) the boxplots by showing the individual observations.
We will just show the observations using geom_point() rather than geom_boxplot() and using the colour aesthetic rather than the fill:
p <- ggplot(data = organdata, mapping = aes(x = reorder(country, donors, na.rm = TRUE), y = donors, colour = world))
p + geom_point() + labs(x = NULL) + coord_flip() + theme(legend.position = "top")
## Warning: Removed 34 rows containing missing values (geom_point).
There is a danger in geom_point() that there will be some overplotting of points. Use geom_jitter() to randomly nudge each point by a small amount, giving us a better sense of how many observations there are at different values.
p <- ggplot(data = organdata, mapping = aes(x = reorder(country, donors, na.rm = TRUE), y = donors, colour = world))
p + geom_jitter() + labs(x = NULL) + coord_flip() + theme(legend.position = "top")
## Warning: Removed 34 rows containing missing values (geom_point).
We can control the amount of jitter using height and width arguments to a position_jitter() function within the geom.
p <- ggplot(data = organdata, mapping = aes(reorder(country, donors, na.rm = TRUE), y = donors, colour = world))
p + geom_jitter(position = position_jitter(width = 0.15)) + labs(x = NULL) + coord_flip() + theme(legend.position = "top")
## Warning: Removed 34 rows containing missing values (geom_point).
A Cleveland dotplot summarizes a categorical variable with one point per category. We can use a summary statistic, such as the average donation rate.
We will use a dplyr pipeline to aggregate the larger country-year data frame to a smaller frame of summary statistics by country.
There are multiple ways to do this. We could choose the variables we want to summarize and then repeatedly use the mean() and sd() functions to calculate the mean and standard deviations of the variables we want.
by_country <- organdata %>% group_by(consent_law, country) %>%
summarize(donors_mean = mean(donors, na.rm = TRUE),
donors_sd = sd(donors, na.rm = TRUE),
gdp_mean = mean(gdp, na.rm = TRUE),
health_mean = mean(health, na.rm = TRUE),
roads_mean = mean(roads, na.rm = TRUE),
cerebvas_mean = mean(cerebvas, na.rm = TRUE))
Here, we group the data by consent_law and, within that, by country. Then summarize() creates six new variables for the innermost grouping, that is, for each country within each consent_law.
The resulting object looks like this:
by_country
## # A tibble: 17 x 8
## # Groups: consent_law [2]
## consent_law country donors_mean donors_sd gdp_mean health_mean
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Informed Austra~ 10.6 1.14 22179. 1958.
## 2 Informed Canada 14.0 0.751 23711. 2272.
## 3 Informed Denmark 13.1 1.47 23722. 2054.
## 4 Informed Germany 13.0 0.611 22163. 2349.
## 5 Informed Ireland 19.8 2.48 20824. 1480.
## 6 Informed Nether~ 13.7 1.55 23013. 1993.
## 7 Informed United~ 13.5 0.775 21359. 1561.
## 8 Informed United~ 20.0 1.33 29212. 3988.
## 9 Presumed Austria 23.5 2.42 23876. 1875.
## 10 Presumed Belgium 21.9 1.94 22500. 1958.
## 11 Presumed Finland 18.4 1.53 21019. 1615.
## 12 Presumed France 16.8 1.60 22603. 2160.
## 13 Presumed Italy 11.1 4.28 21554. 1757
## 14 Presumed Norway 15.4 1.11 26448. 2217.
## 15 Presumed Spain 28.1 4.96 16933 1289.
## 16 Presumed Sweden 13.1 1.75 22415. 1951.
## 17 Presumed Switze~ 14.2 1.71 27233 2776.
## # ... with 2 more variables: roads_mean <dbl>, cerebvas_mean <dbl>
There was a lot of repetition in this code, using mean() and sd(), adding _mean and _sd suffixes. Here is a better way of doing it, which will summarize every numerical variable and name them in a consistent way:
by_country <- organdata %>% group_by(consent_law, country) %>%
summarize_if(is.numeric, list(~mean, ~sd), na.rm = TRUE) %>%
ungroup()
summarize_if() examines each column and applies a test to it - in this case if is.numeric() returns TRUE. If the test is passed, it applies each function in the list(~) function. Finally we ungroup the data. Here is what the pipeline returns:
by_country
## # A tibble: 17 x 28
## consent_law country donors_mean pop_mean pop_dens_mean gdp_mean
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Informed Austra~ 10.6 18318. 0.237 22179.
## 2 Informed Canada 14.0 29608. 0.297 23711.
## 3 Informed Denmark 13.1 5257. 12.2 23722.
## 4 Informed Germany 13.0 80255. 22.5 22163.
## 5 Informed Ireland 19.8 3674. 5.23 20824.
## 6 Informed Nether~ 13.7 15548. 37.4 23013.
## 7 Informed United~ 13.5 58187. 24.0 21359.
## 8 Informed United~ 20.0 269330. 2.80 29212.
## 9 Presumed Austria 23.5 7927. 9.45 23876.
## 10 Presumed Belgium 21.9 10153. 30.7 22500.
## 11 Presumed Finland 18.4 5112. 1.51 21019.
## 12 Presumed France 16.8 58056. 10.5 22603.
## 13 Presumed Italy 11.1 57360. 19.0 21554.
## 14 Presumed Norway 15.4 4386. 1.35 26448.
## 15 Presumed Spain 28.1 39666. 7.84 16933
## 16 Presumed Sweden 13.1 8789. 1.95 22415.
## 17 Presumed Switze~ 14.2 7037. 17.0 27233
## # ... with 22 more variables: gdp_lag_mean <dbl>, health_mean <dbl>,
## # health_lag_mean <dbl>, pubhealth_mean <dbl>, roads_mean <dbl>,
## # cerebvas_mean <dbl>, assault_mean <dbl>, external_mean <dbl>,
## # txp_pop_mean <dbl>, donors_sd <dbl>, pop_sd <dbl>, pop_dens_sd <dbl>,
## # gdp_sd <dbl>, gdp_lag_sd <dbl>, health_sd <dbl>, health_lag_sd <dbl>,
## # pubhealth_sd <dbl>, roads_sd <dbl>, cerebvas_sd <dbl>,
## # assault_sd <dbl>, external_sd <dbl>, txp_pop_sd <dbl>
Let’s plot the average donation rate, donors_mean on a Cleveland dotplot, mapping consent_law to colour:
p <- ggplot(data = by_country, mapping = aes(x = donors_mean, y = reorder(country, donors_mean), colour = consent_law))
p + geom_point(size = 3) +
labs(x = "Donor Procurement Rate",
y = "",
colour = "Consent Law") +
theme(legend.position = "top")
Alternatively, rather than colouring consent_law differently, we could facet by it. We will use facet_wrap(~consent_law), having a panel for each of the two consent laws (Informed and Presumed), where countries are ordered by average donation rate within each panel.
p <- ggplot(data = by_country, mapping = aes(x = donors_mean, y = reorder(country, donors_mean)))
p + geom_point(size = 3) +
facet_wrap(~consent_law) +
labs(x = "Donor Procurement Rate",
y = "")
There are a couple of wrinkles.
Firstly, by default, facet_wrap() will plot the panels side-by-side, making it harder to compare the donation rates (shown on the x-axis). Use the ncol = 1 argument in facet_wrap() to show them in one column, one panel above the other.
Secondly, the y-axis will show all countries on both panels, even though some belong in one panel and some in the other. In that case, only some rows will have points and others will have blanks. Use the scales = "free_y" argument in facet_wrap(). This is only sensible for categorical variables, not continuous ones.
The result of these changes look like this:
p <- ggplot(data = by_country, mapping = aes(x = donors_mean, y = reorder(country, donors_mean)))
p + geom_point(size = 3) +
facet_wrap(~consent_law, scales = "free_y", ncol = 1) +
labs(x = "Donor Procurement Rate",
y = "")
Cleveland dotplots are generally preferred to bar or column charts. When making them, put the categories on the y-axis and order them in the way that is most relevant to the numerical summary you are providing.
This sort of plot is also an excellent way to summarize model results or any data with error ranges. Using geom_pointrange(), we can show a point estimate and a range around it. Here we will use the standard deviation of the donation rate that we have already calculated. This is also a good way to represent model coefficients with confidence intervals.
With geom_pointrange(), we map our x and y variables as usual. We need to tell it the range of the line to draw on either side of the point, defined by the arguments ymax and ymin. This is given by the y value (donors_mean) plus or minus its standard deviation (donors_sd). The function expects a number, but it is OK to give it a mathematical expression that resolves to a number.
p <- ggplot(data = by_country, mapping = aes(x = reorder(country, donors_mean), y = donors_mean))
p + geom_pointrange(mapping = aes(ymin = donors_mean - donors_sd, ymax = donors_mean + donors_sd)) +
labs(x = "", y = "Donor Procurement Rate") + coord_flip()
Because geom_pointrange() expects y, ymin and ymax, we map donors_mean to y then flip the axes using coord_flip().
It can sometimes be useful to plot the labels along with the points in a scatterplot, or just plot informative labels directly. We can do this with geom_text().
p <- ggplot(data = by_country, mapping = aes(x = roads_mean, y = donors_mean))
p + geom_point() + geom_text(mapping = aes(label = country))
Unfortunately, the label appears over each point as they are positioned using the same x and y mapping. We could drop the points by removing geom_point().
Or we could pass the hjust argument to geom_text() to left-justify (hjust = 0) or right-justify (hjust = 1) the label.
p <- ggplot(data = by_country, mapping = aes(x = roads_mean, y = donors_mean))
p + geom_point() + geom_text(mapping = aes(label = country), hjust = 0)
Instead of using geom_text(), we will use ggrepel which provides geom_text_repel() and geom_label_repel().
Install and load the library in the usual way:
library(ggrepel)
## Warning: package 'ggrepel' was built under R version 3.5.3
We will use geom_text_repel() along with the elections_historic dataset, which contains historical U.S. presidential election data.
elections_historic %>% select(2:7)
## # A tibble: 49 x 6
## year winner win_party ec_pct popular_pct popular_margin
## <int> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1824 John Quincy Adams D.-R. 0.322 0.309 -0.104
## 2 1828 Andrew Jackson Dem. 0.682 0.559 0.122
## 3 1832 Andrew Jackson Dem. 0.766 0.547 0.178
## 4 1836 Martin Van Buren Dem. 0.578 0.508 0.142
## 5 1840 William Henry Harrison Whig 0.796 0.529 0.0605
## 6 1844 James Polk Dem. 0.618 0.495 0.0145
## 7 1848 Zachary Taylor Whig 0.562 0.473 0.0479
## 8 1852 Franklin Pierce Dem. 0.858 0.508 0.0695
## 9 1856 James Buchanan Dem. 0.588 0.453 0.122
## 10 1860 Abraham Lincoln Rep. 0.594 0.396 0.101
## # ... with 39 more rows
For each year, we see the winning president, along with their share of the electoral college and their share of the popular vote. We will plot these shares against each other for each presidential victory.
p_title <- "Presidential Elections: Popular & Electoral College Margins"
p_subtitle <- "1824-2016"
p_caption <- "Data for 2016 are provisional"
x_label <- "Winner's share of Popular Vote"
y_label <- "Winner's share of Electoral College Votes"
p <- ggplot(elections_historic, mapping = aes(x = popular_pct,
y = ec_pct,
label = winner_label))
p + geom_hline(yintercept = 0.5, size = 1.4, colour = "gray80") +
geom_vline(xintercept = 0.5, size = 1.4, colour = "gray80") +
geom_point() +
geom_text_repel() +
scale_x_continuous(labels = scales::percent) +
scale_y_continuous(labels = scales::percent) +
labs(x = x_label, y = y_label, title = p_title, subtitle = p_subtitle, caption = p_caption)
The shares are stored as proportions (between 0 and 1) rather than percentages, so we adjust the labels of the scales using the scale_x_continuous() and scale_y_continuous() functions.
We have used geom_hline() and geom_vline() to plot the horizontal and vertical line, giving it an intercept, size and colour. You can also provide a geom_abline() geom, with a slope and intercept
Sometimes we want to pick out some points of interest in the data without labelling every single item. We will still use geom_text() and geom_text_repel(), only we will use subset() to pick a subset of the dataset used in geom_point().
p <- ggplot(data = by_country, mapping = aes(x = gdp_mean, y = health_mean))
p + geom_point() +
geom_text_repel(data = subset(by_country, gdp_mean > 25000),
mapping = aes(label = country))
p <- ggplot(data = by_country, mapping = aes(x = gdp_mean, y = health_mean))
p + geom_point() +
geom_text_repel(data = subset(by_country,
gdp_mean > 25000 | health_mean < 1500 | country %in% "Belgium"),
mapping = aes(label = country))
We have two plots, with two subsets of the data being passed to geom_text_repel(). The subset() function takes the by_country object, and selects only cases where a logical expression is met.
In the first case, only countries where gdp_mean is over 25,000 are labelled.
In the second case, the only countries labelled are those where gdp_mean is over 25,000 or health_mean is less than 1,500 or the country is Belgium.
Alternatively, we can pick out specific points using a specially-created dummy variable. We will create a new column in organdata called ind which will be coded as TRUE if ccode is “Ita” or “Spa” and if year is greater than 1998.
Then we use ind in two ways in the plot. First, we map it to the colour aesthetic. Second, we use it to subset the data that will be labelled. We will also use the guides() function to remove the labels that would otherwise appear.
organdata$ind <- organdata$ccode %in% c("Ita", "Spa") & organdata$year > 1998
p <- ggplot(data = organdata,
mapping = aes(x = roads, y = donors, colour = ind))
p + geom_point() +
geom_text_repel(data = subset(organdata, ind),
mapping = aes(label = ccode)) +
guides(label = FALSE, colour = FALSE)
## Warning: Removed 34 rows containing missing values (geom_point).
Use annotate() to annotate the plot directly. It isn’t a geom but it uses geoms. We will use a text geom, so that annotate() can access x, y and label, as well as size, colour, hjust and vjust. We will use \n as a newline break.
p <- ggplot(data = organdata, mapping = aes(x = roads, y = donors))
p + geom_point() +
annotate(geom = "text",
x = 91, y = 33,
label = "A suprisingly high \n recovery rate.",
hjust = 0)
## Warning: Removed 34 rows containing missing values (geom_point).
annotate() can work with other geoms, such as rectangles, line segments and arrows. Let’s add a rectangle:
p <- ggplot(data = organdata, mapping = aes(x = roads, y = donors))
p + geom_point() +
annotate(geom = "rect",
xmin = 125, xmax = 155,
ymin = 30, ymax = 35,
fill = "red", alpha = 0.2) +
annotate(geom = "text",
x = 91, y = 33,
label = "A suprisingly high \n recovery rate.",
hjust = 0)
## Warning: Removed 34 rows containing missing values (geom_point).
We have extended our ggplot vocabulary, having introduced the scale_ functions, the guides() function and the theme() function. Let’s learn more.
Every aesthetic mapping has a scale. If you want to adjust it, use scale_ functions.
Many scales come with a legend or key. These are called guides. If you want to adjust them, (for example, make them disappear), use the guides() function.
To adjust features of the graph not connected to the logical structure of the data (such as background colours, typefaces or the positioning of legends), use the theme() function.
Scales and guides are closely connected. Guides provide information about the scale, such as in a legend or a colour bar. Thus it is possible to make adjustments to guides from inside the various scale_ functions, although it is often easier to use the guides() function directly.
Let’s look at an example:
p <- ggplot(data = organdata,
mapping = aes(x = roads,
y = donors,
colour = world))
p + geom_point()
## Warning: Removed 34 rows containing missing values (geom_point).
There are religion aesthetic mappings. roads maps to x, donors maps to y and world maps to colour.
The x and y scales are both continuous, ranging smoothly from just under the lowest value to just over the highest value. The colour mapping also has a scale. world is an unordered, categorical variable; so it has a discrete scale. It has four values, so it is represented by four colours.
Mappings like fill, shape and size also have scales. We could have mapped world to shape, in which case our four-category variable would have a scale consisting of four different shapes. We can still adjust these scales using the scale_ functions.
Many different kinds of variables can be mapped. Usually, x and y are continuous measures. But they can be discrete, as when we mapped country names to the y-axis in our boxplots and dotplots. An x or y mapping can also be defined as a transformation onto a log scale, or a special sort of number like a date.
Similarly a colour or fill mapping can be discrete and unordered (as with our world variable) or discrete and ordered (as with letter grades in an exam). A colour or fill mapping can also be a continuous quantity, represented as a colour gradient running smoothly from a low to a high value. Finally, both continuous gradients and ordered discrete values might have some defined midpoint with extremes diverging in both directions.
We have a different scale_ function for each mapping and scale. They are named according to a consistent logic:
scale_<MAPPING>_<KIND>()
For example, scale_x_continuous() controls x scales for continuous variables. scale_y_discrete() controls y scales for discrete variables. scale_x_log10() transforms an x mapping to a log scale.
If you want to adjust the labels or tick-marks on a scale, you will need to know which mapping it is for and what sort of scale it is. Then you supply the arguments to the appropriate scale function. For example, let’s change the x-axis of the previous plot to a log scale and then also change the position and labels of the tick-marks on the y-axis:
p <- ggplot(data = organdata, mapping = aes(x = roads, y = donors, colour = world))
p + geom_point() +
scale_x_log10() +
scale_y_continuous(breaks = c(5, 15, 25),
labels = c("Five", "Fifteen", "Twenty Five"))
## Warning: Removed 34 rows containing missing values (geom_point).
The same applies to mappings like colour and fill.
When working with a scale that produces a legend, we can also use its scale_ function to specify the labels in the key. To change its title, however, we use the labs() function, which lets us label all the mappings.
p <- ggplot(data = organdata, mapping = aes(x = roads, y = donors, colour = world))
p + geom_point() +
scale_colour_discrete(labels = c("Corporatist", "Liberal", "Social Democratic", "Unclassified")) +
labs(x = "Road Deaths", y = "Donor Procurement", colour = "Welfare State")
## Warning: Removed 34 rows containing missing values (geom_point).
We have already seen that we can move the legend using theme(legend.position = "top"). And we can make the legend disappear with guides(colour = FALSE).